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ABSTRACT 

Summary: We developed MolBioLib to address the need for adapt- 
able next-generation sequencing analysis toois. The result is a 
compact, portable and extensively tested C-|~|-1 1 software framework 
and set of applications tailored to the demands of next-generation 
sequencing data and applicable to many other applications. 
MolBioLib is designed to work with common file formats and data 
types used both in genomic analysis and general data analysis. A 
central relational-database-like Table class is a flexible and powerful 
object to intuitively represent and work with a wide variety of tabular 
datasets, ranging from alignment data to annotations. MolBioLib has 
been used to identify causative single-nucleotide polymorphisms in 
whole genome sequencing, detect balanced chromosomal rearrange- 
ments and compute enrichment of messenger RNAs (mRNAs) on 
microtubules, typically requiring applications of under 200 lines of 
code. MolBioLib includes programs to perform a wide variety of 
analysis tasks, such as computing read coverage, annotating gen- 
omic intervals and novel peak calling with a wavelet algorithm. 
Although MolBioLib was designed primarily for bioinformatics pur- 
poses, much of its functionality is applicable to a wide range of prob- 
lems. Complete documentation and an extensive automated test suite 
are provided. 

Availability: MolBioLib is available for download at: http://sourceforge 

. net/project s/molbioli b 

Contact: ohsumit@molbio.mgh.harvard.edu 

Received on May 16, 2012; revised on July 12, 2012; accepted on 
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1 INTRODUCTION 

Next-generation sequencing requires a data analysis approach 
capable of handling large, complex and varied datasets, from 
large sets of reads to complex polymorphisms to existing feature 
files. In addition, the competitive nature of research demands 
rapid development of methods that are flexible enough to inte- 
grate new and quickly evolving algorithms. Tools have been de- 
veloped to address these needs, such as GATK (McKenna et al., 
2010). However, packages written in Java (e.g. GATK) require 
the maximum memory heap space to be specified at run time 
(Oracle, 2011), limiting how the input data are formatted and 
handled. For example, a coverage program would require more 
memory to compute coverage of a query-ordered SAM file 
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versus a position-ordered SAM file, because a sUding window 
of coverage cannot be used. Programs written in C++ do not 
require the heap size to be specified and are only limited by the 
amount of available memory. Other packages written in C++ 
have their strengths, but they also have limitations that suggest a 
niche for our software, MolBioLib. Arachne (Batzoglou et al, 
2002; Jaffe et al, 2003), the .NET Bio project by Outercurve 
Foundation (Outercurve, 2012) and NCBTs C++ Toolkit 
(Vatakov, 2012) provide many functions, but are not compact 
and do not always clearly identify the primary objects. 
Furthermore, the .NET Bio project is specific to the Windows 
environment (Mercer, 2012) and Arachne is specific to a particu- 
lar Linux environment. IBM's GenomicTools (Tsirigos, 2012) 
has many very useful tools, but addresses common bioinfor- 
matics tasks at a lower level than MolBioLib, such as providing 
command-line tools rather than a unified program to generate 
ChlP-seq output. Other packages, such as Bio++ (Dutheil, 
2008), libsequence (Thornton, 2003) and TIGR++ (Majoros, 
2012), are targeted toward specific applications and not designed 
to provide breadth of functionaUty. The package that most clo- 
sely resembles MolBioLib's philosophy is SeqAn (Doring, 2008), 
though it is written in an older version of C++ and thus does not 
take advantage of the variadic templates or other modern fea- 
tures of C++11 (ISO/IEC, 2011). 

MolBioLib fills the need for a platform-independent, exten- 
sively tested, compact and efficient C++ 1 1 library and an exten- 
sive set of bioinformatics applications that can be used to analyze 
data and rapidly develop new tools. MolBioLib's library includes 
a variety of useful objects and functions, such as a relational- 
database-like object, a text file reader object that simplifies data 
input, statistical functions and peak calling methods that can 
operate on any array of values, such as per base sequence cover- 
age. In addition, MolBioLib includes a broad range of tools, 
such as to generate coverage, hits of reads to features and 
ChlP-seq, all in one unified package. 

The design of MolBioLib is based on four principles. The first 
is to simplify bioinformatics programming in C++11, achieved 
by developing a library that includes many common bioinfor- 
matics tasks. For example, C++ 1 1 requires programmers to 
write specialized data structures to sort associated data keeping 
them together, such as feature information associated with a 
position. Additionally, to iterate either sequentially or randomly 
through a tab-separated-values (TSV) file and select values from 
specific columns would require the creation of a function to split 
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a line on tabs and constructs to index and traverse a text file. 
These, and many other common tasks, are built into MolBioLib, 
thus greatly simplifying the code one needs to write. It is hoped 
that MolBioLib will allow bioinformaticians to consider C++ II 
as a possible language of choice. Second, MolBioLib is efficient. 
C++ II is used because it is the new standard that intro- 
duces constructs for making objects such as Table. C++ 11 is 
efficient since it is a compiled language with no inherent restric- 
tion on memory heap size at run time. Templates are used ex- 
tensively to compact code, avoid inefficient virtual table lookups 
and maintain type safety. Objects and method parameters are 
often templated so that they may be in-lined by the compiler. 
Third, MolBioLib promotes clarity and compactness by conso- 
lidating common operations into a concise set of objects. We also 
provide an extensive library of functions that are not intrinsic to 
one object, such as those that convert one data type to another, 
e.g. splitString converts a string to a vector<string>. 

Given the range of problems MolBioLib addresses, the source 
code is compact: ~ 10 000 lines of code and comments for the 
core objects and functions. Among the 101 included applications, 
86% are coded in fewer than 200 lines and 59% in fewer than 
100 lines. In contrast, without such a framework, the user would 
have to code the thousands of lines of code to reproduce 
MolBioLib's functionality. Finally, MolBioLib is extensively 
tested and facilitates easy testing and debugging of its applica- 
tions. Automated tests are provided for all objects and functions. 
Additional validation of the code base comes from extensive 
application of MolBioLib to many molecular biology projects 
(Lau ef al, 2009; Sharp et ciL, 2011; Talkowski et al., 2011; 
Zhao et al, 2010; Raif S. Geha, manuscript in preparation). 
To simplify use of MolBioLib, all libraries are include files fol- 
lowing the Boost convention (Schaling, 2011). Debugging and 
memory checking is thus faciUtated with tools such as with 
Valgrind (Nethercote and Seward, 2007; Seward and 
Nethercote, 2005) since applications in MolBioLib consist of a 
main program file with many include files. Additional input and 
programming checks are incorporated into the framework 
through optional compiler flags. 

2 METHODS 

MolBioLib is hierarchically structured for ease of use. It contains three 
main components: the library consisting of a set of objects and functions, 
the set of applications and the documentation. 

// First, read in the ref .fasta file 

// into ref Sequence. 

Fasta ref Sequence ( " " ref . fasta' ' ) ; 

// The Fasta object makes it easy to apply 
//a function to a subset of sequences based 
// on, for instance, header text. 
// Define a wrapper function that 
// reverse-complements a sequence: 
void f (Sequence & seq) { 

seq. selfReverseComplement ( ) ; } 

A script is included to compile all the external packages [such as 
BAMTools (Barnett et al, 2011)], applications and optionally build 
and run tests. MolBioLib can be used independendy of the external pack- 
ages and interfaces. The documentation for all of MolBioLib may be 
generated automatically using the included Doxygen configuration file 



(van Heesch, 2011). The introductory pages of the Doxygen output 
show how to compile and use MolBioLib both as a set of tools as well 
as a programming framework. Functions that transform one data type to 
another are separated from the objects. Finally, the applications are hier- 
archically organized by usage type. 

Table<string, string, size_t> myTablel , 

myTable2 ; 

Table<string, string, size_t, string, 

string, size_t> inner JoinedTable ; 
readTSVTable (myTablel , ' ' tableData . tsv' ' ) ; 
string si, s2; size_t 11; 
// Below gets the values from the 5th row. 
myTablel. getRow (4, si, s2, 11); 
/ / Below adds data to myTable2 , much like 
// the vector push_back operator. 
myTable2 . push_back ( ' ' some string 1 ' ' , 
" " string 2 ' ' , 5 ) ; 
// Fill myTable2 similarly. . . 
// Bellow inner joins the two tables on the 
/ / first column of both myTable ' s and 
// stores the result in inner JoinedTable 
inner JoinTables<0 , 0> (myTablel, myTable2, 
inner JoinedTable) ; 
// Below sorts the inner JoinedTable 
// on the 6th coluirai. 
sortTable<5> ( inner JoinedTable) ; 
writeCSVTable ( inner JoinedTable , 

" " iimer JoinedTable . outputFile . CSV' ') ; 

Several novel classes power rapid development with MolBioLib. The 
primary object that stores data in MolBioLib is Table, which is a con- 
tainer class similar to the C++ STL vector class except that each column 
may store a different data type, through C++irs variadic templates. 
Variadic templates allow the coding of objects that can accept an arbi- 
trary number of template parameters. However, writing variadic tem- 
plated objects and functions can become cumbersome (Gregor and 
Jarvi, 2008; Kalev, 2008). Therefore, MolBioLib includes the Table 
object to represent tabular data, a mainstay of bioinformatics data ex- 
change, in an intuitive and easily used fashion. The Table structure was 
based on the relational database (Codd, 1970) model, where related data 
are stored row-wise. Column data types are specified through the tem- 
plate parameters. A Table may be thought of as a generalized vector. 
It includes data row insertion and retrievals operations that are simple to 
use. Database-like operations for Table, such as concatenation, filtering 
and inner and outer joins, are provided. Example usage of a Table is: 

where the readTSVTable is a function to read a TSV file into a table. 
This tabular grouping of data can be used for many bioinformatics tasks. 
One example is the Fasta object, derived from the Table object, which 
stores sequences and their headers. The Fasta object simplifies access to 
sequence data: 

The primary object to read text files is ReadOnlyStringFile. The 
class automatically creates an index of a file, if not already present, so the 
file may be accessed as if it were a string array. The index file is created by 
going through the text file once and noting the starting file position of 
each line in the index file. The index file itself has a fixed length per line, 
simplifying the process of finding the index position. Thus, to access a line 
in the text file, the appropriate line in the index file is looked up. 
Subsequently, the line starting at the file position indicated by the look 
up is read in. Almost no memory is required in using the index. A 
ReadOnlyStringFile object functions like an array in which each 
element is one row of the file. ReadOnlyTSVFile is a particularly 
useful derived class of ReadOnlyStringFile. The values of each 
tab-separated field of each row can be accessed by the operator[] 
method, returning a vector containing the parsed elements of one row. 
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Applications written with MolBioLib capitalize on the abihty of the 
ReadOnlyStringFile class to hide all the housekeeping chores 
involved in parsing data from delimited text files. 

For example, to sample a random subset of a TSV input file, one 
would code: 

If a file only needs to be traversed once, sequentially a line at a time, 
ReadOnlyStringFile can traverse the file without creating an index 
file. This eliminates the time to build and store an index. 

The ReadOnlySeguencesFile, based on the 
ReadOnlyStringFile, is a FASTA/FASTQ reader object. It can 
work in a random access mode or sequentially traverse the file, providing 
all the read-only operations of the Fasta object, thus greatly simplifying 
access to sequence read files. 

/ / Then apply function f to all sequences 
/ / whose header does not have the string 
// " "chr_Un' ' in it (false means not matches) : 
ref Sequence . applyToHeaderRegex ( 
" 'chr_Un' ' , f , false) ; 

Other general objects in MolBioLib include a sparse vector object, e.g. 
for use in storing sparse coverage, a map facilitator between keys and 
rows in a Table, and a parameterized type interval object with asso- 
ciated overlap and set functions. A random number generator class that 
includes a permutation vector is also provided. Other bioinformatics ob- 
jects included are an alignment object, a Sequence class with operations 
such as reverse-complementing, a feature object and a peak object for 
storing local extrema of numeric data. All of these classes have been used 
to simplify coding of novel bioinformatics analyses. 

ReadOnlyTSVFile f ileArray ( ' ' input . tsv' ' ) ; 

size_t numLines = f ileArray . size ( ) ; 

/ / RandomLib below is part of MolBioLib 

/ / The numbers are seeds . 

RandomLib randf (1802 , 9373); 

for (size_t i = 0; i<100; -|-|-i) { 

size_t randRowNum = 

randf . randomSize_t ( 0 , numLines-1) ; 

vector<string> tokens = 

f ileArray [randRowNum] ; 

// Process tokens here. . . 

} 

f ileArray . close ( ) ; 

Functions included in MolBioLib are diverse, intended to cover four 
broad categories: algorithms, file readers and writers for various data 
types, system utilities and transformation of data types to different 
data types. Algorithm functions include Table functions, peak detection 
and statistics. Table functions include smoothing values, sorting and 
upper_bound and lower_bound of Tables analogous to their 
vector class counterparts. File reader and writers include those for 
various alignment formats, including SAM (Li et al, 2009), Helicos 
Biosciences" TXT format (Helicos, 2010), NCBI tabular BLAST 
output (Madden, 2003), feature readers for tabular files such as for 
UCSC's refGene annotation table (refGene.txt; Kent, 2012), Ensembl's 
Biomart in TSV format (Flicek el al., 201 1; Smedley el ai, 2009) and the 
OFF format (Wellcome, 2012), streams for intervals and vectors, tuple 
streams and writers and Table reader and writers. Peak detection func- 
tions for numeric arrays will be discussed in more detail in the ChlP-seq 
section. 

System utilities include a powerful command-line parsing system. 
Various command-line argument types are provided, including numeric 
and string. Input and output file name argument types provide file check- 
ing to ensure all input files are present at run time and to prevent acci- 
dental overwrites of output files. Furthermore, the command-line parser 
automatically records the date and time the program was compiled as 



well as the command line to simplify documentation of computational 
steps and pipelines. System utilities also provide functions to transform 
one data type to another, such as string conversions to and from various 
data types as well as string splitting (on one or more delimiters). 

The compiler of choice for MolBioLib is clang-|— I- version 3.0 
and above (Clang, 2012) using the associated libc-|— I- library and is avail- 
able on the Linux, Mac OS X and MS Windows platforms. It supports a 
large subset of C++11, has very good compiler error messages and 
is efficient. MolBioLib also works with GNU g++ 4.7 and above 
(Gcc, 2012). 



3 RESULTS 

One of the primary goals of MolBioLib is to provide a set of 
programs that address the most common bioinformatics ana- 
lyses. Here we describe applications in MolBioLib that address 
four common bioinformatics analyses: annotating a list of fea- 
tures, counts of hits to features, coverage and ChlP-Seq. We also 
touch upon additional useful utilities included in MolBioLib. 

The MolBioLib application addFeaturesToTSVFile per- 
forms the very common task of adding gene annotations or more 
generically 'features', to an input file in which each row describes 
a genomic interval. Examples of annotations include the 
refGene.txt file downloadable from UCSC's genome browser 
site (Fujita et al., 201 1) that contains the gene ID, name, chromo- 
some, strand, and start and stop positions of the transcript. 
Other annotation files include tabular data from the Ensembl/ 
Biomart website (Flicek et al., 2011; Smedley et al., 2009), where 
one can download any set of genes with user-selected attributes 
such as IDs, names, positions, expression data and protein 
domain. There are numerous other annotation sources, many 
of which consist of carefully curated private data, on a topical 
website, or in a published supplement to a journal article. 
Other common tabular formats include the BED, 
PSL and GFF formats (Kent, 2012; Wellcome, 2012). Using 
addFeaturesToTSVFile, the genomic footprint of any such 
annotation can be intersected with another TSV file containing 
genomic intervals. The application will use the genomic interval 
specified on each row of the input file and find all intersecting 
feature coordinates (with matching strand, if specified) 
and add the appropriate annotation(s) to the row in the 
output. Importantly, this application will take in any input 
TSV as well as any annotations in TSV form (such as 
those noted above) and thus may be used on a wide variety of 
projects. 

Another common bioinformatics task is to count the number 
of alignments mapping to a set of features in a TSV file, such as 
refGene.txt, promoter regions or classes of sequence repeats. For 
example, we have used this method to quantify the number of 
reads derived from genie regions, different classes of genomic 
repeats and from different classes of non-coding RNAs (Sharp 
et al., 20II). ref FeaturesAnalysis offers a number of op- 
tions, such as shifting the positions of features (e.g. probing hits 
to upstream UTRs instead of the genes), filtering for a specific 
set of reads, maximum error in alignments and RPKM normal- 
ization (Mortazavi et al., 2008). The input features are printed 
along with the count of alignments overlapping each input fea- 
ture and meeting any filtering criteria. 
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Sequence coverage can be used for determining the number of 
reads mapped to a base or region and also for finding 
polymorphisms. MolBioLib has a unique coverage tool that out- 
puts strand-specific statistics as well as a count of mismatched 
bases observed at each position. Coverage can be run on a 
user-defined set of regions and normalized to the number of 
million reads in the input reads file. Additionally, coverage can 
be executed using only the midpoint position to identify where 
the reads align. Finally, coverage can be run using uniquely 
aligning reads. Programs are provided to post-process coverage 
output in various ways, such as multiplying by some constants 
(either for the whole file or by contig/chromosome), windowing 
the coverage and converting the coverage to a wiggle file. 
Coverage output is the input for the ChlP-Seq analysis program 
discussed next. 

An experimental ChlP-Seq program with a novel wavelet 
method is our final example of application of the MolBioLib 
library. Peak detection is one major next-generation sequencing 
application. Given an ahgnment file, this algorithm finds the 
coverage on each strand at each location, computed either per 
base or in bins spanning a user-defined number of bases. This is 
done for both the sample as well as for a background sample or 
control, and coverage is smoothed using a kernel smoother. One 
challenge of peak detection is finding peaks having a wide range 
of widths and heights. We address this by applying a 
translation-independent wavelet smoother applied at various 
scales, finding local peaks at each scale, and then ranking puta- 
tive peaks by using ridge lines that identify peaks detected across 
multiple scales (Du et ah, 2006). Peaks with a longer ridge length 
are more isolated from other peaks, because they show up as 
peaks at various length scales. Optional filters remove low signals 
and spikes. This peak detection method is included in 
MolBioLib. Other smoothers, such as Gaussian, are also imple- 
mented in MolBioLib. 

We validated the wavelet peak detector on published 
H3K4me3 ChlP-Seq data (Myers et ah, 2011). As expected, 
H3K4me3 peaks are enriched in promoter regions (Zhao et al., 

2007) . The percentage of peaks in promoter regions was 57% 
compared to 33% using the window tag density method 
(Kharchenko et ai, 2008) and 53% using MACS (Zhang et al, 

2008) . Peaks were well defined, with a mean width of 232 bp 
compared to 2840 bp and 654 bp for the other two methods 
tested on the same data. 

There are many other applications included with MolBioLib 
that address much of the essential bioinformatics analyses done 
in next-generation sequencing projects. Some of the more 
common tasks include computing statistics on a list of numbers, 
creating histogram files from data files (both numeric and string), 
converting alignment formats. Additional tasks include common 
operations on FASTA and FASTQ files, such as obtaining a 
subset, trimming and removing duplicate reads. Moreover, 
there are programs to combine, print subsets and inner join 
TSV files. Intersection, subtraction and union operations of 
text files are also included. 

4 DISCUSSION 

MolBioLib fills the need for an efficient, reliable and compact 
C++ 1 1 bioinformatics framework. It is portable across many 



platforms and aligner formats and is fully documented. 
MolBioLib is unique in offering complete analysis programs 
for a variety of other very common tasks not addressed by 
other toolkits, from feature hit counts to coverage to ChlP-Seq. 

MolBioLib classes offer considerable power and convenience 
for creating novel analysis applications. A central and very gen- 
eral Table class simulating the functionality of a database eases 
construction of many programs. The Table class is based on a 
collection vectors, thus having a small memory overhead com- 
pared to other data structures such as a map. Capacity for larger 
datasets is only limited by the amount of available memory. 
File readers provide efficient methods to perform ubiquitous 
file I/O tasks. These classes will have general utility for applica- 
tion development beyond the specific needs of computational 
biology. 

As MolBioLib gains adoption, we aim to incorporate many of 
the applications both user-contributed and those developed for 
our projects into the main distribution through the SourceForge. 
net code repository mechanism. 
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